Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIPCode: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [1]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 uszipcode==1.0.1 sqlalchemy_mate==1.4.28.4 -q --user
  Preparing metadata (setup.py) ... done
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 19.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 24.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 27.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/77.1 kB 2.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 25.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.0/57.0 kB 2.3 MB/s eta 0:00:00
  Building wheel for atomicwrites (setup.py) ... done
  WARNING: The scripts f2py, f2py3 and f2py3.10 are installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.
google-colab 1.0.0 requires pandas==2.1.4, but you have pandas 1.5.3 which is incompatible.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.54 which is incompatible.
mizani 0.11.4 requires pandas>=2.1.0, but you have pandas 1.5.3 which is incompatible.
pandas-stubs 2.1.4.231227 requires numpy>=1.26.0; python_version < "3.13", but you have numpy 1.25.2 which is incompatible.
plotnine 0.13.6 requires pandas<3.0.0,>=2.1.0, but you have pandas 1.5.3 which is incompatible.
xarray 2024.9.0 requires pandas>=2.1, but you have pandas 1.5.3 which is incompatible.

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [1]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# libraries necessary for model building
# import libraries to split data into training and test sets
from sklearn.model_selection import train_test_split

# import libraries to build decision tree models
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to compute tree classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
)

# import libraries to tune different models
from sklearn.model_selection import GridSearchCV

# import libraries to perform data preprocessing
from sklearn.preprocessing import StandardScaler

# import libraries to interpret Zipcode values
from uszipcode import SearchEngine

# to suppress unnecessary warnings
# import warnings
# warnings.filterwarnings('ignore')
/root/.local/lib/python3.10/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

Loading the dataset¶

In [2]:
# run the following lines for Google Colab authorization via a dialog window
from google.colab import drive
drive.mount('/content/drive')

# loading the dataset via the file and its directory
folder_path = '/content/drive/MyDrive/AI-ML Post Graduate/Data Documents/'
file_name = 'Loan_Modelling.csv'
dataset = pd.read_csv(folder_path+file_name)
Mounted at /content/drive

In order to avoid altering the original data, and to manipulate without danger of loss of data, we are going to make a copy of the dataset to use throughout this notebook.

In [3]:
# making a copy of the dataset
data = dataset.copy()

Data Overview¶

First, we shall observe the content of our dataset; also called Data Overview.

In [4]:
# observing the first and last 5 rows of the dataset
data
Out[4]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

5000 rows × 14 columns

In [5]:
# to obtain the number of rows and column of the dataset
rows, columns = data.shape
print(f'The dataset has {rows} rows and {columns} columns.')
The dataset has 5000 rows and 14 columns.
In [6]:
# let us visualize the datatype of the columns
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Observations:

  • With the exception of CCAvg, all the columns are whole numbers, as in type integer.
  • CCAvg is the only continuous datatype in the dataset.
In [7]:
# we shall visualize the statistical summary of the columns
data.describe(include="all").T
Out[7]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Observations:

  • Less than 75% of people accepted the personal loan.
  • The age of the customers ranges from 23 to 67, while the average is ~45.
  • The average income is around ~\$64k
  • The average expense on credit cards is around \$2k
  • The size of customers' famillies does not reach further than 4.
  • Around half of the customers utilize online banking facilities.

Note: There are more observations and inferences that can be done, however since this is simply an Overview, further inquiry will be done in the Exploratory Data Analysis.

Before we investigate further on this data, we have to proceed with sanity checks.

Checking for missing values¶

In [8]:
# visualizing missing values by the sum of NaN values
# since all the columns are numerical types
data.isna().sum()
Out[8]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0

In [9]:
# Let's observe if there are duplicated rows
dups = data.duplicated().sum()
print(f"There are {dups} duplicated rows")
There are 0 duplicated rows

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

Let us first observe the distribution of the columns.

In [10]:
# observing the distribution of the columns via a histogram

# setting the size of the overall figure
plt.figure(figsize=(15, 10))

# making a list of all the data columns
# excluding ID and ZIPCode since their numerical value
# does not provide any meaningful insight
data_col = data.drop(columns=["ID", "ZIPCode"]).columns.to_list()

# plotting the histogram for each attribute of the dataset
# and appending it to the overall figure
for i, attribute in enumerate(data_col):
  plt.subplot(3,4, i+1)
  sns.histplot(data = data, x = attribute)

# showing the figure tightly
plt.tight_layout()
plt.show()

Observations:

  • The Income and CCAvg hold a right skewed distribution, with most people holding less than 100k dollars and spending less than ~3k dollars.
  • The majority of people do not have any value in house mortgage. Similarly, a majority of people did not accept the personal loan, have securities accounts, have a certificate of deposit, or use a foreign bank's credit card.
  • The Age and Experience column hold a multimodal distribution, with the age of people holding peaks at approximately 30, 39, 51, and 59 years of age; while Experience at apporximately 5, 11, 20, 29, and 35 years of professional experience.
  • Most people make use of online banking facilities.
In [11]:
# observing the distribution of the columns via a boxplot

# setting the size of the overall figure
plt.figure(figsize=(15, 10))

# making a list of all the data columns
# excluding ID, ZIPCode and the binary columns (those that hold either 0 or 1)
# since their numerical value does not provide any meaningful insight
data_col = data.drop(columns=["ID", "ZIPCode", "Personal_Loan", "Securities_Account", "CD_Account", "Online", "CreditCard"]).columns.to_list()

# plotting the boxplot
# for each attribute of the dataset from the list
# and appending it to the overall figure
for i, attribute in enumerate(data_col):
  plt.subplot(3,3, i+1)
  sns.boxplot(data = data, x = attribute)

# showing the figure tightly
plt.tight_layout()
plt.show()

Observations:

  • As seen previously, the Income holds 75% of customers below a 100k dollars income. Similarly, CCAvg shows 75% of customers spending below 3k dollars on credit cards. However, we can start to visualize outliers in both of these plots, primarily in Credit Card Spending.
  • Since most people do not have a house mortgage, it shows those that have over 250k in value as outliers.
  • 50% of customers are older than 45 years of age. Equally, 50% of the customers hold more than 20 years of professional experience.
  • Around 50% of customers have a family size of 2.

Thanks to this visualization, we have started to observe outliers. These outliers can represent either erroneous values, or simply a small percentage of values compared to the overall data. The treatment for these will be done in Data Preprocessing.

Before moving on Multivariate analysis, we shall answer one of the given questions.

How many customers have Credit Cards?¶

We can derive the answer to this question by excluding those who do not spent money on credit cards.

In [12]:
# Exctrating a smaller dataframe that
# holds those with average credit card spending of 0.
# From that dataset we extract the number of rows that
# represent the number of customers.
customers = data[data["CCAvg"] == 0].shape[0]

# we subtract this number from the overall customer amount
customers = data.shape[0] - customers

# Printing the number of customers with credit cards
print(f"There are {customers} customers with credit cards.")
There are 4894 customers with credit cards.

ZIPCode¶

During our Exploratory Data Analysis, we ignored the value of ZIPCode as it did not provide any meaningful insights on its numerical value. However, we will inquire further on their value to acquire relevant information.

This will be done via the library uszipcode, which provides information on US Zipcodes. Since our bank is settled in the US, we shall assume that most, if not all customers in our dataset hold US Zipcodes and not foreign values.

In [13]:
# Let's see first how many unique zipcodes are in the data
unique_zipcodes = data["ZIPCode"].unique().shape[0]
print(f"There are {unique_zipcodes} unique zipcodes.")
There are 467 unique zipcodes.

Now we will work with uszipcode, specifically with its main instance of SearchEngine() that allows us to obtain information about the zipcode from its numerical value.

In [14]:
# Creating the instance of SearchEngine()
zip_search = SearchEngine()
Download /root/.uszipcode/simple_db.sqlite from https://github.com/MacHu-GWU/uszipcode-project/releases/download/1.0.1.db/simple_db.sqlite ...
  1.00 MB downloaded ...
  2.00 MB downloaded ...
  3.00 MB downloaded ...
  4.00 MB downloaded ...
  5.00 MB downloaded ...
  6.00 MB downloaded ...
  7.00 MB downloaded ...
  8.00 MB downloaded ...
  9.00 MB downloaded ...
  10.00 MB downloaded ...
  11.00 MB downloaded ...
  Complete!

Let us observe the range of values given to a single zipcode.

In [15]:
# Using the method of the instance to search the zipcode
# and giving the value in the dictionary type
zip_search.by_zipcode(94116).to_dict()
Out[15]:
{'zipcode': '94116',
 'zipcode_type': 'STANDARD',
 'major_city': 'San Francisco',
 'post_office_city': 'San Francisco, CA',
 'common_city_list': ['San Francisco'],
 'county': 'San Francisco County',
 'state': 'CA',
 'lat': 37.74,
 'lng': -122.48,
 'timezone': 'America/Los_Angeles',
 'radius_in_miles': 2.0,
 'area_code_list': '415',
 'population': 43698,
 'population_density': 16901.0,
 'land_area_in_sqmi': 2.59,
 'water_area_in_sqmi': 0.04,
 'housing_units': 16283,
 'occupied_housing_units': 15445,
 'median_home_value': 734400,
 'median_household_income': 83407,
 'bounds_west': -122.510407,
 'bounds_east': -122.458635,
 'bounds_north': 37.764001,
 'bounds_south': 37.733771}

From all the values given, we will only use some of them. Additionally, since some of these values are not given for some of the zipcodes, we will use the values that can represent the zipcode. These values are the following:

  • Zipcode
  • Major City
  • State
  • County
  • Latitude
  • Longitude

During our analysis, if any of these values do not hold any meaningful insights into our objective, we will drop these attributes from the analysis moving forward. To ease our analysis, we will make a dataset from all of the unique and valid zipcodes.

In [16]:
# We will create an empty dataframe for the zipcodes
# and a temporary list to hold the values
zipcodes = pd.DataFrame()
temp_list = []

# Cycling through all the unique zipcodes in the dataset
for zc in data["ZIPCode"].unique():
  # the try section will execute the comands inside
  # only if nothing causes an exception
  try:
    # We search for the specific zipcode
    zip_info = zip_search.by_zipcode(zc)
    # creating a dictionary filled with the zipcode's values
    zip_dict = {
        "ZIPCode": zc,
        "Major_City": zip_info.major_city,
        "State": zip_info.state,
        "County": zip_info.county,
        "Latitude": zip_info.lat,
        "Longitude": zip_info.lng
        }
    # the dictionary wil be appended to the temporary list
    temp_list.append(zip_dict)
  # If an exception occured, such as a non valid zipcode entered
  # the except will enter the zipcode searched and empty values in the dict
  # which it'll append to the list
  except:
    zip_dict = {
        "ZIPCode": zc,
        "Major_City": None,
        "State": None,
        "County": None,
        "Latitude": np.NaN,
        "Longitude": np.NaN
        }
    temp_list.append(zip_dict)

# When all the zipcodes are searched for,
# the temporary list is incorporated into the dataframe created.
zipcodes = pd.DataFrame(temp_list)
# printing the dataframe to validate the operation
zipcodes
Out[16]:
ZIPCode Major_City State County Latitude Longitude
0 91107 Pasadena CA Los Angeles County 34.16 -118.08
1 90089 Los Angeles CA Los Angeles County 34.02 -118.29
2 94720 Berkeley CA Alameda County 37.87 -122.25
3 94112 San Francisco CA San Francisco County 37.72 -122.44
4 91330 Northridge CA Los Angeles County 34.25 -118.53
... ... ... ... ... ... ...
462 90068 Los Angeles CA Los Angeles County 34.13 -118.33
463 94970 Stinson Beach CA Marin County 37.91 -122.65
464 90813 Long Beach CA Los Angeles County 33.78 -118.18
465 94404 San Mateo CA San Mateo County 37.55 -122.26
466 94598 Walnut Creek CA Contra Costa County 37.91 -122.01

467 rows × 6 columns

With our new dataset of zipcodes, let's run a simple overview before analysing it's values.

ZIPCode Analysis¶

In [17]:
zipcodes.describe(include="all").T
Out[17]:
count unique top freq mean std min 25% 50% 75% max
ZIPCode 467.0 NaN NaN NaN 93077.233405 1806.56932 90005.0 91743.0 93022.0 94605.0 96651.0
Major_City 463 244 Los Angeles 35 NaN NaN NaN NaN NaN NaN NaN
State 463 1 CA 463 NaN NaN NaN NaN NaN NaN NaN
County 463 38 Los Angeles County 116 NaN NaN NaN NaN NaN NaN NaN
Latitude 463.0 NaN NaN NaN 35.671102 2.14791 32.55 33.94 34.39 37.74 41.76
Longitude 463.0 NaN NaN NaN -119.816911 2.070952 -124.11 -122.02 -118.95 -118.005 -115.65

Observations:

  • There are 244 unique cities, with the most frequent being Los Angeles.
  • The latitude ranges from 32.5° to 41.7°, with its average of 35.6°
  • The longitude ranges from -124.11° to -115.65°, with its average of -119.81°
  • There are 38 unique counties, with the most popular being Los Angeles County.
  • We can see that all of the zipcodes are located in CA, therefore we will drop that column from the dataset.

A simple glance tells us that the majority of the customers in this sample hail from Los Angeles, CA. However, we must delve deeper to conclude if these values are relevant to our prediction procedures.

In order to get relevant insights from these values, we must append them to the original dataframe.

In [18]:
# Before we append, let's see its metadata
zipcodes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 467 entries, 0 to 466
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ZIPCode     467 non-null    int64  
 1   Major_City  463 non-null    object 
 2   State       463 non-null    object 
 3   County      463 non-null    object 
 4   Latitude    463 non-null    float64
 5   Longitude   463 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 22.0+ KB

We can observe that there are missing values in the data in the form of null objects. Our decision on the matter will be reflected and realized in Data Preprocessing.

Now, we shall combine the two datasets into one using the Pandas' function merge(). Thankfully, since both datasets use the same values for ZIPCode, merging them is very simple.

In [19]:
# Merging the two datasets with "ZIPCode" as their anchor.
# In order to avoid mismanipulation of our dataset, let us make a copy
data_w_zip = pd.merge(data, zipcodes, on="ZIPCode", copy=True)
# returning the dataset to validate the procedure.
data_w_zip
Out[19]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Major_City State County Latitude Longitude
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0 Pasadena CA Los Angeles County 34.16 -118.08
1 456 30 4 60 91107 4 2.2 2 0 0 0 0 1 0 Pasadena CA Los Angeles County 34.16 -118.08
2 460 35 10 200 91107 2 3.0 1 458 0 0 0 0 0 Pasadena CA Los Angeles County 34.16 -118.08
3 576 54 30 93 91107 1 2.7 2 0 0 0 0 1 0 Pasadena CA Los Angeles County 34.16 -118.08
4 955 37 12 169 91107 2 5.2 3 249 1 0 0 1 0 Pasadena CA Los Angeles County 34.16 -118.08
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 4081 27 0 40 90068 1 2.0 2 110 0 0 0 0 1 Los Angeles CA Los Angeles County 34.13 -118.33
4996 4347 45 21 33 94970 3 0.5 1 136 0 0 1 1 1 Stinson Beach CA Marin County 37.91 -122.65
4997 4624 50 25 45 90813 2 0.6 3 0 0 0 0 0 0 Long Beach CA Los Angeles County 33.78 -118.18
4998 4802 34 10 88 94404 2 0.0 1 121 0 0 0 1 0 San Mateo CA San Mateo County 37.55 -122.26
4999 4868 38 12 61 94598 4 0.2 3 0 0 0 0 1 0 Walnut Creek CA Contra Costa County 37.91 -122.01

5000 rows × 19 columns

Now that we have a dataset with zipcode values, we can employ this to reflect insights on our customer samples.

Note: During the analysis, we will use both the data and data_w_zip; only using the latter to observe interactions with the zipcode in order to reduce unnecesary processing power and simplicity in the plots.

In [20]:
# Let us see which county is the most popular in our sample
# setting the size
plt.figure(figsize=(5,10))

sns.countplot(
    # selecting our combined dataset
    data=data_w_zip,
    # to provide better legibility, we shall plot on the Y axis instead of X
    y="County",
    # scaling the data to a percentage
    stat='percent',
    # ordering the data for better visualization
    order = data_w_zip["County"].value_counts(ascending=True).index
    );
# showing the plot
plt.show()

Observations:

  • Los Angeles County is by far the most seen county, being in ~23% of the customers.
  • The 3 most seen county after LA county are: San Diego and Santa Clara county sharing ~12%, and Alameda County with 10%.
In [21]:
# Since there are 244 unique Major Cities,
# instead of a plot we will visualize with a table (Pandas Series to be exact.)
data_w_zip["Major_City"].value_counts(normalize=False).head(10)
Out[21]:
Major_City
Los Angeles 375
San Diego 269
San Francisco 257
Berkeley 241
Sacramento 148
Palo Alto 130
Stanford 127
Davis 121
La Jolla 112
Santa Barbara 103

Observations:

  • Los Angeles is the most popular Major City, with 375 customers.
  • The following 3 most popular are San Diego, San Francisco and Berkeley with 269, 257 and 241 customers.
In [22]:
# Let us visualize the distribution of Latitude and Longitude
# to observe if there may be relevant information

# Setting the size
plt.figure(figsize=(10,5))
# creating a temporary list to cycle through with an incremental variable
temp_list = ["Latitude", "Longitude"]
i=1

# creating the loop for the attributes
for attribute in temp_list:
  # draws the plot in the "i"th position.
  plt.subplot(2,2,i)
  sns.histplot(data=data_w_zip, x=attribute, kde=True)
  # incrementing the position variable and drawing a second time
  i+=1
  plt.subplot(2,2,i)
  sns.boxplot(data=data_w_zip, x=attribute)
  i+=1

# shows the overall plot
plt.tight_layout()
plt.show()

Observations:

  • Latitude presents a right skewed distribution, with most of the users in 34°.
  • Longitude presents a more light right skewed distribution, more akin to a multimodal distribution, with most users in -122°.

Inferences being made using these two attributes are scarce without further parameters. If they do not show relevant information, the chances of removing these attributes increases.

Multivariate Analysis¶

In this section we shall analyse attributes with one another to derive insights from them.

In [23]:
# setting the size
plt.figure(figsize=(15,10))

# creating the heatmap to visualize the correlation between numerical values.
sns.heatmap(
    # selecting only the numerical values that hold meaning
    data_w_zip.drop(columns=["ID","ZIPCode"]).corr(numeric_only=True),
    # displaying the correlation value
    annot=True,
    # variable for the colorpalette of the heatmap
    cmap="coolwarm",
    # minimum value
    vmin=-1,
    # maximum value
    vmax=1,
    # format to show the correlation
    fmt='.2f'
);
In [24]:
# showing the size
plt.figure(figsize=(15,10))

# selecting only the numerical data
num_data = data_w_zip.drop(columns=["ID","ZIPCode", "Online", "CreditCard","Securities_Account", "CD_Account", "Family", "Major_City", "State", "County"])

# Drawing a pairplot of all the numerical data with Personal_Loan as a hue
sns.pairplot(num_data, hue="Personal_Loan");
<Figure size 1500x1000 with 0 Axes>

Observations:

  • With respect to Personal_Loan, Income is positively correlated; with CCAvg and CD_Account holding less positive correlation value, but meaningful nonetheless. Education and Mortgage hold little but relevant positive correlation with it.
  • Income and CCAvg hold very high positive correlation. Additionally, both Family and Education hold a small negative correlation with both of these attributes.
  • Education holds small positive correlation with Personal_Loan and Family. However, it also holds small negative correlation with Income and CCAvg.
  • CD_Account holds positive correlation, in descending order of magnitude, with Personal_Loan, Securities_Account, CreditCard, Online, Income, CCAvg and Mortgage.
  • Age and Experience, alongside with Latitude and Longitude are closely related to one another, thus their high correlation value.

After seeing the correlation heatmap and the pairplot, we shall probe further into those that hold correlation for better analysis.

Note: Some of this alaysis can be seen in the pairplot, however due to its grand size, we shall generate simpler plots for analysis.

Personal Loan¶

We will be seeing Personal Loan throught this analysis, however this specific section will emphazise those that hold positive correlation with Personal Loan.

In [25]:
# Visualizing the relationship between CCAvg, Income and Personal_Loan
sns.scatterplot(data=data, x="Income", y="CCAvg", hue="Personal_Loan");

Observations:

  • People below an income of 100k dollars tend to reject the personal loan, while those above it tend to accept it.
  • A correlation between Income and CCAvg can be visualized by the slope in the graph, indicating that as income increases, the average spent on credit cards tends to increase. A similar, but smaller correlation can be seen between those who accpeted the personal loan and credit card spending, denoted by the bigger density of points and a more partial covered slope.
    • Note: It is not implying that the increase of income leads to more average credit card spenditure.
In [26]:
# Visualizing the retlationship between CD_Account, CCAvg and Personal_Loan
# in two graphs
plt.figure(figsize=(10,5))

# first plot
plt.subplot(1,2,1)
sns.boxplot(data=data, x="CD_Account", y="CCAvg", hue="Personal_Loan")
# second plot
plt.subplot(1,2,2)
sns.countplot(data=data, x="CD_Account", hue="Personal_Loan")

# showing the overall plot
plt.tight_layout()
plt.show()

Observations:

  • The correlationn between CCAvg and Personal Loans can be seen here as well, due to 50% of those who accepted spending more than 3.5k dollars in contrast to those that didn't with 75% of them spending less than 3k dollars.
  • The correlation between CD_Account and Personal Loan can be seen in the outliers. People that do have a certificate are less than those who do not have the certificate.

We shall answer one of the questions proposed.

How does a customer's interest in purchasing a loan vary with their age?¶

In [27]:
# visualizing the relationship between age and Personal_Loan
# in a subplot, as in two plots.
# Setting the size
plt.figure(figsize=(10,5))

# drawing the first plot, a histogram
plt.subplot(1,2,1)
sns.histplot(data=data, x="Age", hue="Personal_Loan");
# drawing the second plot, a box plot
plt.subplot(1,2,2)
sns.boxplot(data=data, x="Age", hue="Personal_Loan");

plt.show()

Observations:

  • The distribution of those who accepted and rejected the loan does not change. Although the high modes shown in the distribution are more tamed with those who accepted the loan.
  • Those who accepted have a slightly bigger minimum value and a slighltly smaller maximum value.

We will continue our analysis in order of appearance in the heatmap.

Family¶

In [28]:
# Visualizing the retlationship between Family, CCAvg and Income
# in two graphs
plt.figure(figsize=(10,5))

# first plot
plt.subplot(1,2,1)
sns.boxplot(data=data, hue="Family", y="CCAvg", palette="Set2");
# second plot
plt.subplot(1,2,2)
sns.boxplot(data=data, hue="Family", y="Income", palette="Set2");

# showing the overall plot
plt.tight_layout()
plt.show()

Observations:

  • Previously, we saw that the correlations family had were not that big. That can be reflected in these graph, albeit small. 75% of families sizes 3 and 4 have their income and CCAvg spending reduced in comparison with families of size 1 or 2.
    • This comparison is seen better while looking at income due to its scale compared to the CCAvg, as well as its standard deviation.

Education¶

In [29]:
# Visualizing the retlationship between Education, CCAvg and Income
# in two graphs
plt.figure(figsize=(10,5))

# first plot
plt.subplot(1,2,1)
sns.boxplot(data=data, hue="Education", y="CCAvg", palette="Set2");
# second plot
plt.subplot(1,2,2)
sns.boxplot(data=data, hue="Education", y="Income", palette="Set2");

# showing the overall plot
plt.tight_layout()
plt.show()

Observations:

  • Similarly to Family's correlation, Income and CCAvg values tend to decrease as Education advances from 1 to 2. Nonetheless, this correlation is small.
In [30]:
# Visualizing the correlation between Personal Loan and Education
sns.countplot(data=data, x="Education", hue="Personal_Loan", stat="percent");
plt.show()

Observations:

  • Albeit small, as Education increases, so does those who accepted the loan.
  • Comparably, people who rejected the loan decreases as Education increases.

Mortgage¶

In [31]:
# Visualizing the relationship with mortgag, income and personal_loan
sns.scatterplot(data=data, x="Mortgage", y="Income", hue="Personal_Loan");
In [32]:
# Since the relationship of mortgage and personal loan is hard to visualize,
# we shall use a kde plot to simplify its density
sns.kdeplot(data=data, x="Mortgage", hue="Personal_Loan", common_norm=False);

Observations:

  • We previously saw that most people did not have a mortgage value. This causes a slight impact on the correlation.
  • For those that have a mortgage value, Income tends to increase as Mortgage increases in value.
  • People who accept the loans and have a mortgage value tend to have a higher value of Mortgage than those who have a mortgage value and rejected the loan.

CD_Account¶

CD_Account has positive correlation value with other attributes we have yet to discuss. We shall analyse those attributes in this section.

In [33]:
# Visualizing CD_Account's relationship with the other attributes
# Setting the size
plt.figure(figsize=(12,7))
# Temporary list to hold the attributes
temp_list = ["Securities_Account", "Online", "CreditCard"]

# cycle through the attributes to draw its plot
for i, attribute in enumerate(temp_list):
  plt.subplot(2,2,i+1)
  sns.countplot(data=data, x=attribute, hue="CD_Account", stat="percent");

# showing the overall plot
plt.tight_layout()
plt.show()

Observations:

  • People who use internet facilites and do not own this bank's credit card tend to have a Certificate of Deposit (CD).
  • Roughly, from the people who have a CD, those who do not have a security account are the same as those that do. However, the people who do not have a CD tend to not have a security account; the same can be said for those that have foreign credit cards.
In [34]:
# Let us see which county is the most popular in our sample
# setting the size
plt.figure(figsize=(10,10))

sns.countplot(
    # selecting our combined dataset
    data=data_w_zip,
    # to provide better legibility, we shall plot on the Y axis instead of X
    y="County",
    # scaling the data to a percentage
    stat='percent',
    # ordering the data for better visualization
    order = data_w_zip["County"].value_counts().index,
    hue = "Personal_Loan"
    );
# showing the plot
plt.show()

Observations:

  • The distribution hardly changes, with most notably being more people from Santa Clara county accepting than San Diego county.

This shows that employing zipcode values may not heavily influence in the model.

Data Preprocessing¶

Previously, we looked at missing values while doing a data overview. However, after employing the values given by the zipcode, some zipcodes were invalid. Let's review those.

Missing Value Treatment¶

In [35]:
# extracting null values, as in missing meaningful values
data_w_zip.isnull().sum()
Out[35]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
Major_City 34
State 34
County 34
Latitude 34
Longitude 34

In [36]:
# Counting how many entries contain our dependant variable: Personal_Loans
# We search for those with NaN values in Latitude,
# select our desired column, "Personal_Loan"
# and count the values.
data_w_zip[data_w_zip["Latitude"].isna()]["Personal_Loan"].value_counts()
Out[36]:
Personal_Loan
0 31
1 3

From the 5000 entries, 34 of those contain an error due to the zipcode; only 3 of those accepted the loan. Since this is a small number of missing values, we are going to drop these values.

In [37]:
# dropping those values
data_w_zip.dropna(inplace=True)
# veryfying the procedure
data_w_zip.isnull().sum()
Out[37]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
Major_City 0
State 0
County 0
Latitude 0
Longitude 0

Feature Engineering¶

Now, we will continue with removing meaningless columns for our model. The following are the attributes that bare small, minimum or no value to the model:

  • ID: The identifier of the customer. Since we have no information relevant to the identifier, we shall drop this column.
  • Age & Experience: By themselves, these do not appear to hold any correlation to the other variables, making them seem irrelevant. We shall experiment with these during our model building.
  • ZIPCode: By itself, the numerical value of ZIPCode brings no further insights into our model.
  • Latitude & Longitude: The impact of the location of the customer has little if any relevance to the loans in our analysis.
  • State: Since all of the customers in our sample are from CA, this column is redundant. Furthermore, one can infer that the model that we build is based solely on CA. It is best to avoid that inference.
  • Major_City & County: The abundance of unique values for these columns brings a lot of noise into our model. Aditionally, since we only have CA data, we would be overfitting for CA customers. We shall drop these attributes as well from the modeling.

Note: These inferences are subjective to a small extent, so a different individual might disagree with these results.

In [38]:
# Let's remove those columns from the dataset.
# We shall make a new variable for this new dataset.
final_data = data_w_zip.drop(columns=["ID", "ZIPCode", "Latitude", "Longitude", "State", "Major_City", "County"])
# Verifying the procedure
final_data.head()
Out[38]:
Age Experience Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 4 1.6 1 0 0 1 0 0 0
1 30 4 60 4 2.2 2 0 0 0 0 1 0
2 35 10 200 2 3.0 1 458 0 0 0 0 0
3 54 30 93 1 2.7 2 0 0 0 0 1 0
4 37 12 169 2 5.2 3 249 1 0 0 1 0

Outlier Detection and Treatment¶

Before we get our final preparations, we have to review the outliers and reflect on them.

In [39]:
# observing the distribution of the columns via a boxplot

# setting the size of the overall figure
plt.figure(figsize=(10, 5))

# making a list of all the data columns
# excluding ID, ZIPCode and the binary columns (those that hold either 0 or 1)
# since their numerical value does not provide any meaningful insight
data_col = final_data.drop(columns=["Personal_Loan", "Education", "Securities_Account", "CD_Account", "Online", "CreditCard"]).columns.to_list()

# plotting the boxplot
# for each attribute of the dataset from the list
# and appending it to the overall figure
for i, attribute in enumerate(data_col):
  plt.subplot(2,3, i+1)
  sns.boxplot(data = data, x = attribute)

# showing the figure tightly
plt.tight_layout()
plt.show()
In [40]:
# Let us focus the graphs on the outliers
# using the subplot "routine" we have been using
plt.figure(figsize=(15,5))
outliers = ["Income", "CCAvg", "Mortgage"]

for i, attribute in enumerate(outliers):
  plt.subplot(2,2, i+1)
  sns.boxplot(data = data, x = attribute)

plt.tight_layout()
plt.show()

From the continuous variables or attributes, we detect outliers in three of them. Income, CCAvg, and Mortgage. Let's discuss them:

  • Income & CCAvg: The outliers here are the result of natural distribution of the data. There are two clusters of outliers to mention. In Income they are around and before 200k dollars, while in CCAvg are around and before 9k dollars. These sets of outliers represent natural information. However, there is a small cluster in both of these columns, above the other outlier cluster. Let's investigate further for these values.
  • Mortgage: The outliers are a result of the most popular value of the Mortgage, being of 0, skewing the distribution. These outliers represent actual information, therefore we will keep these valies.
In [41]:
data[ data["Income"] > 200].sort_values(by="Income", ascending=False)
Out[41]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
3896 3897 48 24 224 93940 2 6.67 1 0 0 0 1 1 1
4993 4994 45 21 218 91801 2 6.67 1 0 0 0 0 1 0
526 527 26 2 205 93106 1 6.33 1 271 0 0 0 0 1
2988 2989 46 21 205 95762 2 8.80 1 181 0 1 0 1 0
677 678 46 21 204 92780 2 2.80 1 0 0 0 0 1 0
2278 2279 30 4 204 91107 2 4.50 1 0 0 0 0 1 0
4225 4226 43 18 204 91902 2 8.80 1 0 0 0 0 1 0
2101 2102 35 5 203 95032 1 10.00 3 0 1 0 0 0 0
3804 3805 47 22 203 95842 2 8.80 1 0 0 0 0 1 0
787 788 45 15 202 91380 3 10.00 3 0 1 0 0 0 0
3608 3609 59 35 202 94025 1 4.70 1 553 0 0 0 0 0
1711 1712 27 3 201 95819 1 6.33 1 158 0 0 0 1 0
1901 1902 43 19 201 94305 2 6.67 1 0 0 1 0 1 0
2337 2338 43 16 201 95054 1 10.00 2 0 1 0 0 0 1
2447 2448 44 19 201 95819 2 8.80 1 0 0 0 0 1 1
4895 4896 45 20 201 92120 2 2.80 1 0 0 0 0 1 1
In [42]:
data[ data["CCAvg"] > 9].sort_values(by="CCAvg", ascending=False)
Out[42]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
787 788 45 15 202 91380 3 10.0 3 0 1 0 0 0 0
2101 2102 35 5 203 95032 1 10.0 3 0 1 0 0 0 0
2337 2338 43 16 201 95054 1 10.0 2 0 1 0 0 0 1
3943 3944 61 36 188 91360 1 9.3 2 0 1 0 0 0 0

Observations:

  • The higher outliers from CCAvg also tend to have high income, thus proving that expenditure amount fairly reasonable.
  • The higher outliers from Income also tend to have outliers from CCAvg. However, since we do not have further information on this specific domain of knowledge, we shall ignore these outliers.

Now that the outliers have been addressed, we shall prepare the data for modeling.

Data Preparation for Modeling¶

We shall split the data into two variables. One that holds only our response variable (dependent variable, in this case Personal_Loan) and another that holds the rest of the explanatory variables (independent variables). Afterwards, we shall split the data into two sets. One that holds data to train the model and another that holds data to test the model.
Since we do not have any categorical attributes with non-numerical values, we can avoid having to create what are known as "dummy variables"

In [43]:
# Defining the explanatory and response variables
X = final_data.drop(columns="Personal_Loan")
y = final_data["Personal_Loan"]

As we split the data, we will use a specified random state. This ensures randomness in the splitting; nonetheless the reason for a defined random state is to ensure a similar split each time for reproducible results.

In [44]:
# Assigning our random state
RS = 69
In [45]:
# Splitting the data to train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    # this is the percentage of the data that will go to the test set
    test_size=0.2,
    # The random state as defined previously
    random_state=RS,
    # stratify will split the response variable equally in the training and test
    # preserving the ratio of value between the diferent sets.
    stratify = y
)

Let us visualize the split:

In [46]:
# Printing the shape of the train and test independent variables
# as well as the division of our desired variable.
print(f"Shape of the Training Set: {X_train.shape}")
print(f"Shape of the Test Set: {X_test.shape}")
print("Percentage of response variables (classes) in training set:")
print(y_train.value_counts(normalize=True) * 100)
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True) * 100)
Shape of the Training Set: (3972, 11)
Shape of the Test Set: (994, 11)
Percentage of response variables (classes) in training set:
0    90.382679
1     9.617321
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
0    90.442656
1     9.557344
Name: Personal_Loan, dtype: float64

Model Building¶

Model Evaluation Criterion¶

Our objective in this project is to predict whether a liability customer will buy personal loans, which attributes contribute to it and identifying segments of customers to target more. The main model we will build is a decision tree, also known as a classification model. This algorithm, based on certain attributes, calculates the probability of a customer buying the loan or not. However, we must establish criteria to evaluate the model.

The model can make the following predictions:

  • Correct Predictions:
    • Predict that a customer buys a loan. (True Positive, TP)
    • Predict that a customer does not buy a loan. (True Negative, TN)
  • Erroneous Predictions:
    • Predict that a customer will buy a loan when in reality, they do not buy the loan. (False Positive, FP)
    • Predict that a customer will not buy a loan when in reality, they actually buy the loan. (False Negative, FN)

This will be visualized on the model on a table called "Confusion Matrix". This will be seen later on.

Let us discuss the evaluation metrics for our decision tree. There are 4 known metrics made for decision trees:

  • Accuracy: This measures the poroportion of correct predictions from all the predictions.
  • Recall: This metric displays the ratio of True Positives to total actual positives.
    • Higher scores result in reduced False Negative.
  • Precision: This, similar to recall, measures the ratio of true positives to total predicted positives.
    • Higher scores result in reduced False Positives.
  • F1 Scores: Mantains a balance between False Positives from False Negative. In other words; it is the balance between Precision and Recall

Given the nature of their algebraic composition, precision aims to reduce False Positive while Recall aims to reduce False Negatives. However, in this specific case, neither of them is given greater importance than the other. Therefore we will aim to increase the F1 Score without giving priority to the other two metrics.
Note: This is not suggesting that the Precision and Recall are not important. It is stating that we will not strive to maximize one value compared to the other.

With this in mind, we will create a function to display the metrics based on the decision tree as an input, as well as another function to display the confusoin matrix.

In [49]:
# Defining a function to compute 4 different metrics
# in order to monitor performance of a decision tree
def model_performance_metrics(model, predictors, target):
  """
  Function to compute 4 different metrics to evaluate classification model performance

  model: decision tree
  predictors: independent variables
  target: dependent variable
  """

  # predictingg with the model using the independent variables.
  predictions = model.predict(predictors)

  # Based on the predictions, compute the different metrics
  accuracy = accuracy_score(target, predictions)
  recall = recall_score(target, predictions)
  precision = precision_score(target, predictions)
  f1 = f1_score(target, predictions)

  # creating a datafrate to simplify its visualization
  df_model_performance = pd.DataFrame(
      {
      "Accuracy": accuracy,
      "Recall": recall,
      "Precision": precision,
      "F1": f1
      },
      index=[0]
  )

  # returns the dataframe created
  return df_model_performance
In [51]:
# Defining a function to illustrate the confusion matrix
# of a decision tree in order to visualize the predictions.
def plot_confusion_matrix(model, predictors, target):
  """
  Function to plot or visualize the confusion matrix of a decision tree

  model: decision tree
  predictors: independent variables
  target: dependent variable
  """

  # predicting with the model using the independent variables
  predictions = model.predict(predictors)

  # creating a confusion matrix using the dependent variable
  # and the model predictions
  con_matrix = confusion_matrix(target, predictions)

  # formatting the values from the matrix
  labels = np.asarray(
      [
          ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / con_matrix.flatten().sum())]
            for item in con_matrix.flatten()
      ]
  ).reshape(2,2)

  # Setting the plot's size
  plt.figure(figsize=(6,4))

  # plotting the matrix through a heatmap
  sns.heatmap(con_matrix, annot=labels, fmt="")

  # setting the x and y labels for the matrix
  plt.xlabel("Predicted Class")
  plt.ylabel("Actual Class")

  # displaying the plot
  plt.show()

Model Building¶

Now, we shall build the first model. Whether this model will be the final model will be decided based on the evaluation metrics. After building the first model, we will create new models with different parameters, hyperparameters and other techniques to avoid overfitting and underfitting; those models will be in Model Performance Improvement.

In [50]:
# Creating the first model. Technically speaking,
# creating the decision tree instance
# Using the random state defined earlier.
model_1 = DecisionTreeClassifier(random_state=RS)

# Now that the instance is created,
# we will fit the tree to the training data
model_1.fit(X_train, y_train)
Out[50]:
DecisionTreeClassifier(random_state=69)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=69)

Now that the model has been trained with the training data, we will observe the confusion matrix and the evaluation metrics for this model.

In [52]:
# Printing the confusion matrix and the metrics of the training data
plot_confusion_matrix(model_1, X_train, y_train)
model_performance_metrics(model_1, X_train, y_train)
Out[52]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

As we can see, the model perfectly identified the True Negatives, being 90.38%; and the True Negatives, our desired predictor of 9.62% of customers. Observe that there are no errors, meaning that the evaluation metrics have reached their maximum value.
Now we will evaluate the same mode with the test data.

In [53]:
# Printing the confusion matrix and the metrics of the test data
plot_confusion_matrix(model_1, X_test, y_test)
model_performance_metrics(model_1, X_test, y_test)
Out[53]:
Accuracy Recall Precision F1
0 0.971831 0.831579 0.868132 0.849462

With the test data, new errors in the predictions can be observed. As the training data was perfect compared to the test data, this is an example of overfitting the data; in other words the model is better at predicting the training data and not apt to unseen data. This is an issue that we'll be attempting to solve during the improvements.

Before we continue; as we will be printing the confusion matrix with the evaluation scores multiple times, a new function will be created to simplify this process.

In [59]:
# creating a function to simplify the execution of this code
def evaluate_model(model, predictors, target):
  plot_confusion_matrix(model, predictors, target)
  # as the perfomance metric is a variable,
  # we will return it as well to be displayed easily
  return model_performance_metrics(model, predictors, target)
In [58]:
# Verifying this process
evaluate_model(model_1, X_test, y_test)
Out[58]:
Accuracy Recall Precision F1
0 0.971831 0.831579 0.868132 0.849462

Model Performance Improvement¶

The second model we will build will be created with a new hyperparameter. DUring the creation of the model, we will establish class_weight="balanced". This will automatically adjust the weights of the decision tree's nodes to be inversely proportional to the class frequencies in the input data.

In [60]:
# creting the second model with balanced class weight
model_2 = DecisionTreeClassifier(random_state=RS, class_weight="balanced")

# fitting the model to the training data
model_2.fit(X_train, y_train)
Out[60]:
DecisionTreeClassifier(class_weight='balanced', random_state=69)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=69)
In [63]:
# Evaluating the model
evaluate_model(model_2, X_train, y_train)
Out[63]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [62]:
# Now let's evaluate on the test data
evaluate_model(model_2, X_test, y_test)
Out[62]:
Accuracy Recall Precision F1
0 0.977867 0.831579 0.929412 0.877778

Observations:

  • The model perfectly classifies the data points in the training set. This indicates that the model is overfitting.
  • Precision improved in comparison to the first model. *Further comparisons will be made until all the models will be built, specifically at "Model Performance Comparison and Final Model Selection"

Now, we will apply what are known as pruning techniques to reduce overfitting.

Pre-Pruning Decision Tree¶

Pre-Pruning a Decision Tree causes the tree to grow until it has reached limits indicated by hyperparameters, similar to class_weight = "balanced". These hyperparameters are to be established before the tree begins to adapt to the training dataset. This causes an impact to the performance in the training data, causing a reduction in overfitting. The introduction in this abstractions causes the tree to become adaptable to unseen data, therefore an improvement in the test data.

We will create a loop to find the correct values for three of the most common hyperparameters in order to get an improved F1 Score, since we are not aiming to maximize either the Recall or Precision values.

In [70]:
# Define the parameters of the tree to iterate

# The maximum depth of the tree
# in a range from 2 to lower than 10, by 1 step
max_depth_values = np.arange(2, 11, 1)
# The maximum number of leaf nodes
max_leaf_nodes_values = [20, 30, 40, 50, 75, 100]
# The minimum number of samples in a node to split it
min_sample_split_values = [20, 30, 50, 70, 100]

# Initialize variables to store the best model and its f1 performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iteration over all the combination of hyperparameters
for max_depth in max_depth_values:
  for max_leaf_nodes in max_leaf_nodes_values:
    for min_sample_split in min_sample_split_values:

      # Initialize a tree with the current hyperparameters in the loop
      estimator = DecisionTreeClassifier(
          # max depth hyperparamenter
          max_depth = max_depth,
          # max leaf hyperparameter
          max_leaf_nodes=max_leaf_nodes,
          # min samples split hyperparameter
          min_samples_split=min_sample_split,
          # setting the balanced class hyperparameter
          class_weight='balanced',
          # The Random State constant
          random_state=RS
      )

      # With this estimator, we will fit the model to the training data
      estimator.fit(X_train, y_train)

      # Predicting with the training and test data
      y_train_prediction = estimator.predict(X_train)
      y_test_prediction  = estimator.predict(X_test)

      # Calculate the f1 scores for the training and test sets
      train_f1_score = f1_score(y_train, y_train_prediction)
      test_f1_score  = f1_score(y_test, y_test_prediction)

      # Calculate the absolute diference between train and testing f1 scores
      score_diff = abs(train_f1_score - test_f1_score)

      # Update the best estimator and score
      # if the difference is smaller than the best difference
      # and if the current score is bigger than the best score
      if (score_diff < best_score_diff) & (test_f1_score > best_test_score):
        # updating the difference
        best_score_diff = score_diff
        # updating the score
        best_test_score = test_f1_score
        # updating the estimator
        best_estimator = estimator

# With the best estimator obtained, we will print its parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test F1 score: {best_test_score}")
Best parameters found:
Max depth: 5
Max leaf nodes: 20
Min samples split: 30
Best test F1 score: 0.8303571428571428

With the best parameters found from this iteration, we will create the instance from the best pre-pruned tree.

In [71]:
# creating an instance of the best model
# with the estimator created from the iteration
model_3 = best_estimator

# fitting the model to the training data
model_3.fit(X_train, y_train)
Out[71]:
DecisionTreeClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=20,
                       min_samples_split=30, random_state=69)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=20,
                       min_samples_split=30, random_state=69)
In [72]:
# Let's evaluate this model with trained data
evaluate_model(model_3, X_train, y_train)
Out[72]:
Accuracy Recall Precision F1
0 0.960473 1.0 0.70872 0.829533
In [73]:
# Let's evaluate this model with test data
evaluate_model(model_3, X_test, y_test)
Out[73]:
Accuracy Recall Precision F1
0 0.961771 0.978947 0.72093 0.830357

Observe that the model is giving generalized results. This shows that the model is less overfitted.

Post-Pruning Decision Tree¶

Post pruning consists of modifying an already trained tree in an effort to reduce its size and complexity. This is mainly accomplished with its cost complexity parameter ccp_alpha. Higher values of this parameter reflects on the increased number of nodes pruned. In order to find the best tree, we will search for a point where the cost and nodes pruned are equally balanced; in other words, before the point where the nodes pruned start losing their value.

We will be using DecisionTreeClassifier.cost_complexity_pruning_path to observe the effective alphas and the impurities of the total leafs. With each increment in alpha, so does the tree gets pruned increasing the impurity.

In [77]:
# let us create a tree to prune
dec_tree = DecisionTreeClassifier(random_state=RS, class_weight="balanced")

# obtaining the path with the train data
path = dec_tree.cost_complexity_pruning_path(X_train, y_train)

# obtaining the alphas and the impurities
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities

# displaying the alphas and the impurities in a dataframe
pd.DataFrame(path).head(10)
Out[77]:
ccp_alphas impurities
0 0.000000e+00 -2.128723e-16
1 8.659121e-19 -2.120064e-16
2 8.813748e-19 -2.111250e-16
3 2.242094e-18 -2.088829e-16
4 2.412184e-18 -2.064707e-16
5 3.092543e-18 -2.033782e-16
6 3.231708e-18 -2.001465e-16
7 3.231708e-18 -1.969148e-16
8 3.487612e-18 -1.934272e-16
9 5.334637e-18 -1.880925e-16
In [78]:
# From the dataframe, we will visualize using a plot

# creating the figure and the axis
# while setting the size
fig, ax = plt.subplots(figsize=(10,5))
# plotting the alphas and the impurities
ax.plot(
    ccp_alphas[:-1],
    impurities[:-1],
    # the type of marker to use on the plot
    marker="o",
    # the drawstyle of the plot
    drawstyle="steps-post"
    )
# seting the label for the x axis
ax.set_xlabel("effective alpha")
# setting the label for the y axis
ax.set_ylabel("total impurity of leaves")
# setting the title of the plot
ax.set_title("Total Impurity vs effective alpha for training set")
# showing the plot
plt.show()

We will train decision trees from the effective alphas. The last value of these is the valule that prunes the whole tree. This last tree is redundant to our analysis, so we will drop it.

In [80]:
# creating a list for all the trees to train
trees = []
# iterating over the alphas
for ccp_alpha in ccp_alphas:
  # Building a tree with the iterated value of alpha
  tree = DecisionTreeClassifier(
      random_state=RS,
      ccp_alpha=ccp_alpha,
      class_weight="balanced"
      )
  # training the tree with the training data
  tree.fit(X_train, y_train)
  # append the tree built into the list
  trees.append(tree)

# To help illustrate the redundancy
print(
    "Number of nodes in the last tree is: {}\n With ccp_alpha: {}".format(
        trees[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1
 With ccp_alpha: 0.28401785511222305

Now, let us visualize the depth of the tree and the number of nodes according to each alpha.

In [82]:
# Dropping the last tree since it is redundant
trees = trees[:-1]
ccp_alphas = ccp_alphas[:-1]

# obtaining the number of nodes for each tree in the list trees
node_counts = [tree.tree_.node_count for tree in trees]
# obtaining the depth for each tree in the list trees
depth = [tree.tree_.max_depth for tree in trees]

# Plotting the nodes and depths
# setting the size, while ggetting the figure and the axis
fig, ax = plt.subplots(2, 1, figsize=(12,7))

# settings for the first plot
# plotting the alphas with the number of nodes
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
# setting the label of x axis
ax[0].set_xlabel("alpha")
# setting the label of y axis
ax[0].set_ylabel("number of nodes")
# setting the title of the first plot
ax[0].set_title("Number of nodes vs alpha")

# settings for the second plot
# plotting the alphas with the depth of the trees
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
# setting the label of x axis
ax[1].set_xlabel("alpha")
# setting the label of y axis
ax[1].set_ylabel("depth of tree")
# setting the title of the first plot
ax[1].set_title("Depth vs alpha")

# showing the plots
fig.tight_layout()

As we can see, the depth of the tree and the number of nodes decreases as alpha increases. Finally, we will observe how the increment of alpha affects the F1 score of the trees.

In [83]:
# initialize an empty list to hold the trainin scores
train_f1_scores = []

# Iterate through the trees
for tree in trees:
  # predict labels using training data
  y_train_prediction = tree.predict(X_train)
  # calculate the F1 score for the training set predictions
  f1_train = f1_score(y_train, y_train_prediction)
  # Append the F1 score to the list of trainin F1 scores
  train_f1_scores.append(f1_train)

# Repeating the same procedure with the test data
# initialize an empty list to hold the trainin scores
test_f1_scores = []

# Iterate through the trees
for tree in trees:
  # predict labels using test data
  y_test_prediction = tree.predict(X_test)
  # calculate the F1 score for the testing set predictions
  f1_test = f1_score(y_test, y_test_prediction)
  # Append the F1 score to the list of testing F1 scores
  test_f1_scores.append(f1_test)
In [84]:
# Plotting the f1 scores with respect to alpha
fig, ax = plt.subplots(figsize=(15,5))

# setting the x label
ax.set_xlabel("Alpha")
# setting the y label
ax.set_ylabel("F1 Score")
# setting the title
ax.set_title("F1 Score vs Alpha")

# plotting the training f1 scores first
ax.plot(
    ccp_alphas,
    train_f1_scores,
    marker="o",
    drawstyle="steps-post",
    label="train"
)

# plotting the test f1 scores afterwards
ax.plot(
    ccp_alphas,
    test_f1_scores,
    marker="o",
    drawstyle="steps-post",
    label="test"
)

# adding a legend to the plot
ax.legend();
In [86]:
# creating the model where we get the highest test f1 score
# extracting the location of the highest f1 test score
index_best_model = np.argmax(test_f1_scores)

# Selecting the tree from the previous index
model_4 = trees[index_best_model]
# printing the tree
print(model_4)
DecisionTreeClassifier(ccp_alpha=0.000264322430759792, class_weight='balanced',
                       random_state=69)

With the best tree from the post-pruning process, we will evaluate this tree.

In [87]:
evaluate_model(model_4, X_train, y_train)
Out[87]:
Accuracy Recall Precision F1
0 0.997231 1.0 0.97201 0.985806
In [88]:
evaluate_model(model_4, X_test, y_test)
Out[88]:
Accuracy Recall Precision F1
0 0.979879 0.884211 0.903226 0.893617

The results are comparable between the training and test data, indicating a generalized performance.

Model Performance Comparison and Final Model Selection¶

After creating various trees with different methods, we will evaluate and compare all of the trees with the purpose of choosing the best model. In order to do this, we will create a data frame holding its evaluation metrics.

In [92]:
# Performance comparison in training data

# extracting the value from the performance with our function
# and concatenating to a new dataframe
model_comparisons_train = pd.concat(
    [
        # evaluating all the models to extract the metrics
        model_performance_metrics(model_1, X_train, y_train).T,
        model_performance_metrics(model_2, X_train, y_train).T,
        model_performance_metrics(model_3, X_train, y_train).T,
        model_performance_metrics(model_4, X_train, y_train).T,
    ],
    # stating whether the values are rows or columns (1 for columns)
    axis=1
)

# creating the labels for the columns
model_comparisons_train.columns = [
    "Default Model",
    "Class_weight",
    "Pre-Pruned Tree",
    "Post_Pruned Tree"
]
In [93]:
# Performance comparison in testing data

# extracting the value from the performance with our function
# and concatenating to a new dataframe
model_comparisons_test = pd.concat(
    [
        # evaluating all the models to extract the metrics
        model_performance_metrics(model_1, X_test, y_test).T,
        model_performance_metrics(model_2, X_test, y_test).T,
        model_performance_metrics(model_3, X_test, y_test).T,
        model_performance_metrics(model_4, X_test, y_test).T,
    ],
    # stating whether the values are rows or columns (1 for columns)
    axis=1
)

# creating the labels for the columns
model_comparisons_test.columns = [
    "Default Model",
    "Class_weight",
    "Pre-Pruned Tree",
    "Post_Pruned Tree"
]

Now, let us visualize them.

In [94]:
print("Training Perfromance comparison")
model_comparisons_train
Training Perfromance comparison
Out[94]:
Default Model Class_weight Pre-Pruned Tree Post_Pruned Tree
Accuracy 1.0 1.0 0.960473 0.997231
Recall 1.0 1.0 1.000000 1.000000
Precision 1.0 1.0 0.708720 0.972010
F1 1.0 1.0 0.829533 0.985806
In [95]:
print("Testing Perfromance comparison")
model_comparisons_test
Testing Perfromance comparison
Out[95]:
Default Model Class_weight Pre-Pruned Tree Post_Pruned Tree
Accuracy 0.971831 0.977867 0.961771 0.979879
Recall 0.831579 0.831579 0.978947 0.884211
Precision 0.868132 0.929412 0.720930 0.903226
F1 0.849462 0.877778 0.830357 0.893617

Observations:

  • Post Pruned Tree is showcasing high scores in both recall and precision, as well as F1 score.
  • Both the Pre and Post Pruned Trees obtained a perfect score in recall during training data.
  • Pre-Pruned Tree obtained the lowest F1 Score from the rest of the trees.

With all the trees compared, analysed and thought upon; we are going to choose the Post_Pruned Tree as our final model, as it shows good scores in both training and test data.

With our tree selected, let us observe the characteristics of it.

In [103]:
# Setting the size of the plot figure
plt.figure(figsize=(20, 10))

# Since we used the variable tree as an iterator,
# it might have overwritten the imported variable.
# Let's reimport it with a diferent name.
from sklearn import tree as _tree

# Extracting the tree
out = _tree.plot_tree(
    model_4,
    # We import the name from the independet variables' variable
    feature_names= list(X_train.columns),
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# Setting the arrows if they are not visible
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

# Showing the plot
plt.show()

As we can see, it is still a big tree. Further observing the tree might not be so prudent. Let's observe the Important Features for the model.

In [104]:
# extracting the important features
importances = model_4.feature_importances_
indices = np.argsort(importances)

# setting the size of the plot
plt.figure(figsize=(12, 12))
# setting the title
plt.title("Feature importances")
# plotting the feature importances
plt.barh(
    range(len(indices)),
    importances[indices],
    color="violet",
    align="center"
    )
# setting the tick labels
plt.yticks(range(len(indices)), [X_train.columns[i] for i in indices])
# setting the x labels
plt.xlabel("Relative Importance")
# showing the plot
plt.show()
  • In the post-pruned tree, the most important features in the model are:
    • Income
    • Family
    • CCAvg
    • Education

As we saw in our EDA, income was highly positively correlated with personal loan, so it was expected to be of importance during the model. However, family was not as highly correlated with personal loan, showcasing that correlation doesn't completely dominate the model.

Actionable Insights and Business Recommendations¶


  • The model built can be used to predict if a customer will reject or accept a loan.

    • With income, size of their family and Credit Card Average spending determining first and foremost the probability of accepting the loan.
  • People with higher size of families, higher than 2 in our analysis, tend to have less income and less average expenditure in credit card.

  • Higher income also indicates a rise in average credit card expenses.

  • All of the sampled data comes from CA, so there is a probability that this model is fitted mainly to that particular state, so diverse regional information may be needed to provide further and broader insights.

  • The eduaction of an individual shows more importance than the years of experience they hold in their decision. The income of the indivual tends to decrease as they hold higher titles of education.

  • People with a Certificate of Deposit tend to use more online facilites. They also tend to own foreign credit cards.

  • A lot of people tend to not have any mortgage value.